Genetic Programming for Multi-Timescale Modeling 
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A bottleneck for multi-timescale dynamics is the computation of the potential energy surface 
(PES). We explore the use of genetic programming (GP) to symbolically regress a mapping of the 
saddle-point barriers from only a few calculated points via molecular dynamics, thereby avoiding 
explicit calculation of all the barriers. The GP-regressed barrier function enables use of kinetic 
Monte Carlo (KMC) to simulate real-time kinetics (seconds to hours) using realistic interactions. 
To illustrate, we apply a GP regression to vacancy-assisted migration on a surface of a binary alloy 
and predict the diffusion barriers within 0.1-1% error using 3% (or less) of the barriers, and discuss 
the significant reduction in CPU time. 

PACS numbers: 02.60.-x, 02.70.Wz, 31.50.-x, 68.35.Fx 



Molecular dynamics (MD) is extensively used for ki- 
netic modeling of materials. Yet MD methods are limited 
to nanoseconds of real time, and hence fail to model di- 
rectly many processes. Recently several approaches were 
proposed for multiscaling USHSEHHHUlIi 
[llj. Methods such as temperature-accelerated dynam- 
ics (TAD) provide significant acceleration of MD but 
they still fall 3-6 orders of magnitude short of real pro- 
cessing times. These methods assume that transition- 
state theory applies, and concentrate only on infrequent 
events. An alternative approach to bridge timescales Q 
uses kinetic Monte Carlo (KMC) [13 combined with MD 
by constructing an a priori list of events (i.e., "look-up 
table"). The table look-up KMC yields several orders 
of magn itude increase in simulated time over MD (e.g., 
see |22j). The table of events is commonly comprised 
of atomic jumps, but collective motions (or off-lattice 
jumps), e.g., see Q, may need to be added but may 
not be known a priori. Additionally, tabulating bar- 
rier energies from a list of events is a serious limitation. 
For example, multicomponent alloys have an impossibly 
large set of barriers, due to configurational dependence, 
making their tabulation impractical, especially from first- 
principles. An alternative approach is calculating ener- 
gies "on-the-fly" 0, 0] , but it too has serious time limi- 
tation (see Fig.^). Recent developments and limitations 
of KMC methods are given, e.g., in [l3| . 

Symbolically-Regressed Table KMC (sr-KMC): To 
avoid the need or expense of explicit calculation of all 
activation barriers — frequent or infrequent — and thereby 
facilitate an effective hybridization of MD and KMC for 
multiscale dynamics modeling, we utilize genetic pro- 
gramming (GP) — a genetic algorithm that evolves com- 
puter programs — to symbolically regress the PES (in our 
case saddle-point barriers only) from a limited set of cal- 
culated points on the PES. An accurate GP-regressed 
PES extends the KMC paradigm, as suggested in Fig. ^ 
to machine learn the "look-up table" and get an in-line 



FIG. 1: Schematic illustration of simulation capabilities 
and bottlenecks of on-the-fly KMC, table look-up KMC, and 
symbolically-regressed KMC (sr-KMC). 



barrier function for increasing number of active configu- 
rations (or complexity) and providing simulation over ex- 
perimentally relevant time frames, which may not be pos- 
sible from standard table look-up or on-the-fly KMC. In- 
terfacing GP with TAD-MD and/or pattern-recognition 
methods will further extend its applicability, e.g., by find- 
ing system-specific mechanisms. Of course, sr-KMC ben- 
efits from any advances in KMC methods. In addition, 
GP-based symbolic regression holds promise in other 
multiscaling areas, e.g., regressing constitutive rules and 
chemical reaction pathways, which we are studying. Also, 
as we exemplify, standard basis-set regression are gener- 
ally not competitive to GP for fixed accuracy due to the 
difficulty in choosing appropriate basis functions. 

To demonstrate, we discuss GP and its application to 
a non-trivial case of vacancy-assisted migration on (100) 
surface of phase-separating Cu^Coi-a;. Although there 
are millions of configurations, only the environmental 
atoms locally around vacancy and migrating atom signif- 
icantly influence the barrier energies. We refer to these 
as the active configurations. The results show that GP 
predicts barriers within 0.1-1% error using calculated 
barriers of less than 3% of the total active configura- 
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FIG. 2: Illustration of tree representation, subtree crossover, 
subtree mutation, and point mutation used in GP. 
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FIG. 3: Sketch of simulation cell for vacancy- assisted mi- 
gration on (lOO)-surface of an fee binary alloy. Atoms in all 
but the bottom layers and the boundary can fully relax. The 
solid (dashed) lines around the migrating atom and vacancy 
represent 1 st (2 nd ) n.n environmental atoms. 



tions. For alloys, this technique can be combined with 
a local cluster expansion technique that reduces the 
explicit barrier calculations to ~0.3% of the active con- 
figurations. Our initial results hold promise to enable the 
use of KMC (even with realistic potentials) for increased 
problem complexity with a scale-up of simulation time. 

Genetic programming [lij is a genetic algorithm that 
evolves computer programs. The program is represented 
by a tree consisting of functions in the internal nodes and 
terminals in the leaf nodes (Fig. Here we use the 

function set T = {+, — , *, /, exp, sin} and the terminal 
set T = {x, 1Z}, where x is a vector representing the 
active alloy configuration, and 1Z is an ephemeral random 
constant |15|. Since we use GP for predicting the barriers, 
a tree represents a PES-prediction function that takes 
a configuration and ephemeral constants as inputs and 
returns the barrier for that configuration as output. 

A tree's quality is given by its fitness /. For this, we 
calculate the barriers {AE ca i c (xi) , • ■ • , AE'caic (%)} for 
M random configurations {xi, a?2, • • • , %m }■ These con- 
figurations are used as inputs to the tree and the barriers 
{Ai5p re( j (a?i) , ■ • ■ , A£p r ed (%m)}, are predicted. The fit- 
ness is then computed as a weighted average of the abso- 
lute error between the predicted and calculated barriers: 



i=l 

with Wi — | AS ca i c | _ , which gives preference to accu- 
rately predicting lower energy (most significant) events. 

Unlike traditional search methods, GP uses a popu- 
lation of candidate solutions (PES prediction functions) 
that arc initially created using the ramped half-and-half 
method |l5l | . Once the population is initialized and eval- 
uated, the following genetic operators are repeatedly ap- 
plied till one or more convergence criteria are satisfied: 



Selection: allocates more copies to solutions with bet- 
ter fitness values. We use an s-wise tournament selection 
[lif . where s candidate solutions are randomly chosen 
and pitted against each other in a tournament. A solu- 
tion with the best fitness wins. 

Recombination combines bits and pieces of two solu- 
tions to create new, hopefully better, solutions. We use 
subtree crossover |15j , where a crossover point for each so- 
lution is randomly chosen and subtrees below the point 
are swapped to create two new solutions, see Fig. |2J 
Mutation locally but randomly modifies a solution. We 
use two mutation techniques (see Fig. El) : Subtree muta- 
tion, where a subtree is randomly replaced with another 
randomly created subtree, and point mutation where a 
node is randomly modified. 

Case Study: We consider the prediction of diffusion 
barriers for vacancy-assisted migration on (100) surface 
of phase-separating Cu^Coi-z. The system consists of 
five layers with 100 to 625 atoms in each layer (see Fig. 

The bottom three layers are held fixed to their bulk 
bond distances, while the top layers are either held fixed 
(as a test) or fully relaxed via MD. We consider only 
first and second nearest-neighbor (n.n.) jumps, along 
with 1 st (as a test) and 2 nd n.n. environmental atoms 
in the active configuration, as shown in Fig. [3] This 
system already exhibits large complexity and is still small 
enough so that table look-up and GP-regressed KMC can 
be implemented and directly compared. Table[I]gives the 
number of active configurations when 1 st and 2 nd n.n. 
environments are considered for a binary alloy. 

We model the atomic interactions with a simple Morse 
potential 01 and a tight-binding potential with second- 
moment approximation (TB-SMA) 18]. To validate 
interactions, we model vacancy-assisted migration on 
(lOO)-surface of Cu and consider only first n.n. jumps. 
The predicted barrier for n.n. vacancy jumps with fully 
relaxed lattice in Cu is 0.39 eV for Morse and 0.45 eV 
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FIG. 4: Activation energies (in eV) predicted by regression. 
GP (circles) and a quadratic polynomial (crosses) are com- 
pared to the calculated (Morse) barriers for 1 st n.n. jumps on 
(lOO)-surface of CU0.5C00.5 for relaxed lattices. As a simple 
test, only first n.n. environments are considered in the active 
configuration. The line is a guide for the eye. 



FIG. 5: (Upper) Calculated vs. GP-predicted barriers (in 
eV) for 2 nd n.n. jumps on relaxed (lOO)-surface of Cuo.sCoo.5 
active configurations up to 2 nd n.n. for Morse and TB-SMA. 
(Lower) GP predicts the barriers with 0.1% (1%) error for 
most- (less-) significant events with AE < 4.8 eV (AE > 
4.8 eV) from only 3% of active configurations. 



for TB-SMA, agreeing with 0.42±0.08 (0.47±0.05) from 
ab initio (EAM) 0] calculations. 

We now consider the barrier regression via GP 
for vacancy-assisted migration on (lOO)-surface of 
Cu x Coi_2;. The input to the barrier regression (i.e., pre- 
diction) function, x = {xj} is a binary-encoded vector 
sequence, where Xj = (1) represents a Cu (Co) atom. 
For simplicity, we begin by considering only seven 1 st n.n. 
environmental atoms yielding 128 active configurations. 
About 20, i.e., 16%, different active configurations are 
randomly chosen and their barriers are computed using 
the conjugate-gradient method and are used in the GP 
fitness function, see Eq. ^ The barriers predicted by 
GP for the relaxed configurations are compared to the 
exact values in Fig. ^ We note that the prediction er- 
ror for rigid lattice case (0.4±0.04%) is significantly less 
than that for relaxed lattice case (2.8±0.08%). Due to 
the weighting used in the fitness function, GP predicts 
barriers for most significant events more accurately than 
for less-significant (higher-energy) events. 

Figure0]also compares the barriers predicted by GP to 
those predicted by a least-squares fit quadratic polyno- 
mial, showing clearly its inadequacy for alloys. Further- 



TABLE I: Number of active configurations for 1 st and 2 nd 
n.n. jumps, and for 1 st and 2 nd n.n. active atoms 







1 st n.n. jumps 


2 nd n.n. jumps 


-J^St 


n.n. active configurations 


128 


128 


2 nd 


n.n. active configurations 


2048 


8192 




Total configurations 


» 2 100 


» 2 100 



more, while GP requires only 16%, the quadratic (cubic) 
polynomial fit needs 27% (78%) of the barriers. In lim- 
ited cases, such as dilute Fei-^Cua;, the barriers can be 
predicted via a simple polynomial fit |2o|. 

To test the scalability of GP with active configuration 
size, we consider the 2 nd n.n. jumps and 1 st and 2 nd 
n.n. environmental atoms in the active configuration. As 
shown in Table|U there are a total of 8192 configurations. 
The energies predicted by GP are compared with direct 
calculations in Fig. along with the error. The GP 
predicts the barriers for most significant events with less 
than 0.1% error by fitting to energies from only 3% (i.e., 
256/8192) of the active configurations (with error defined 
in |23j). In comparison, a cubic polynomial fit requires 
energies for ~6% of the configurations, predicting the 
barriers with 2.5% error for the most significant events. 

The results shown in Figs. 0] and [5] clearly demonstrate 
the effectiveness of GP in predicting the potential en- 
ergy surface, with high accuracy and little information. 
As expected, since the regression and barrier calculation 
are nearly independent, the GP performance does not 
depend on the potentials used, e.g., Fig. |S] shows re- 
sults for both Morse, and non-additive and non-linear 
tight-binding potentials. The regression only requires a 
database of barriers and has no knowledge (nor the need) 
of the underlying potential used. We also find that the 
GP performance is independent of the configuration set 
used in calculating the fitness function, the order in which 
they are used, and the labeling scheme used to convert 
the configuration into a vector of inputs. Differences in 
activation-energy scale on the PES prediction via GP is 
also negligible. That is, even though the barriers for the 
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1 st and 2 nd n.n. jumps differ by an order of magnitude, 
GP predicts the barriers with similar accuracy. More- 
over, for more complex, cooperative effects, such as island 
diffusion via surface dislocations [2]]], sr-KMC could be 
interfaced with pattern-recognition methods [24| (see [2{| 
for long-range fields). 

Time enhancements by coupling a GP-regressed barri- 
ers with KMC (or sr-KMC) are simple to estimate. For 
our example, with ~33 times fewer calculated barriers 
GP symbolically regresses an in-line barrier function — 
rather than the complete look-up table — and thus, sr- 
KMC provides a direct CPU savings of ~100 over ta- 
ble look-up methods. Additionally, each time step of 
sr-KMC requires only CPU-seconds for an in-line 

function evaluation, as opposed to on-the-fly KMC which 
require seconds (empirical potentials) to hours (quantum 
methods), thus providing a gain of 10 4 -10 7 CPU-seconds. 
For our example, one relaxed barrier calculation takes 
-10 sees (-1800 sees) for Morse (TB-SMA). One impor- 
tant question, especially for bulk diffusion, is how the 
gain from sr-KMC scales with system complexity. While 
we cannot fully answer this question yet, in the present 
study it is remarkable and promising that the fraction 
of explicit barrier calculations required by sr-KMC de- 
creases as the number of active configurations increases. 

To summarize, potential energy surface prediction us- 
ing symbolic regression via genetic programming (GP) 
holds promise as an efficient tool for multi-scaling in dy- 
namics. The GP based KMC approach avoids the need 
or expense of calculating the entire potential-energy sur- 
face, is highly accurate, and leads to significant scale-up 
in simulation time and a reduction in CPU time. We 
have shown on a non-trivial example of vacancy-assisted 
migration on a surface of Cu^Coi-a; that GP predicts 
all barriers with 0.1-1% error from calculations for only 
3% of active configurations, allowing seconds of simula- 
tion time. For alloy problems, the number of direct bar- 
rier calculations can further be reduced by over an order 
of magnitude by hybridizing GP with cluster expansion 
methods We emphasize that the GP is non-trivially 
regressing a function and its coefficients that approxi- 
mates the potential-energy surface, and its efficacy over 
standard basis-set regression is clear. Moreover, GP ap- 
proach is not problem specific and requires little modifi- 
cation, if any (say by choice of operators and functions), 
to address increasingly complex cases, and as suggested, 
is potentially useful in other multiscaling areas. 
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